Research Question: For all seasons of Jeopardy, from Season 1 through Season 35, what are the most common calendar years referenced in an answer?
Introduction: Knowing what years have been commonly referenced in past seasons of Jeopardy is interesting to know from a human interest or anthropology viewpoint and is potentially advantageous to future Jeopardy contestants. Regarding the former perspective, the years that are frequently referenced gives us an idea of the years in which people are most interested. There is potential bias in what we consider memorable years. Often, we tend to recognize the years in which major historical events occurred. However, although those years may be significant because they contain a single, prominent event, that does not necessarily mean that those will be the same years that are commonly referenced in Jeopardy answers. This leads into the later perspective. There are years in which multiple, more miniscule, but nonetheless noteworthy, events occurred. For years that contain a plethora of smaller remarkable events, the total number of times those years are referenced as Jeopardy answers will be greater. Thus, potential contestants would benefit from focusing on learning about those years as opposed to years with fewer, but perhaps more major, events. Contestants may find it useful to know if there are a few standout years that tend to surface more often in Jeopardy answers. Furthermore, it is interesting to note whether answers are more concerned with years that tend to be associated with events such as major military happenings or more contemporary activities such as pop-culture events. Again, this provides insight into anthropological aspects of what people deem important as well as a potential benefit to contestants regarding what years they should focus their preparatory attention.
Methods: We began our process by opening and establishing an R Markdown project. The first step towards answering our question began by downloading questions and answers from the Jeopardy seasons. All of these seasons were available on jwolle1’s GitHub in a dataset called “jeopardy_clue_dataset” that contained 350,000 Jeopardy clues from Season 1 through the duration of Season 35 (Jwolle1, 2019). We used three primary packages, tidyverse (Wickham, 2017), ggthemes (Arnold, 2019) and knitr (Xie, 2019), as well as two types of projects, R Markdown (Allaire et al., 2019) and Flexdashboard (Iannone, Allaire & Borges, 2018), and utilized regular expressions for pattern searching as well as plotting features for visualization. We read in the data and cleaned it by selecting the three categories of relevant interest: category, answer, and question. In order to extract information regarding the years referenced in the Jeopardy answers, we created a new column using the mutate function and extracted all four-digit numbers using the str_extract function from the stringr package within tidyverse. We then used filter and str_count on the new column to limit our results to dates that were contained exactly four-digit numbers. There were a few four-digit numbers that were not years, so we sliced the dataframe in a way that cut off numbers that were not a year between 1000-2019. In order to further simplify the dataset, we created another new dataframe that contained two columns that summarized each year and the count of the total number of times that year was referenced. One issue that arose was the way in which R was reading the “years” column. Initially, the column was a character vector. However, in order to use those values for plotting, we needed to convert it to a numeric vector. Simply using the as.numeric function alone was insufficient because it created a new, separate dataframe rather than converting the vector within our primary dataset. To counter this issue, we used the as.numeric function within the transform function. Prior to transforming it, however, we mutated the dataset and created a new column, “count,” that counted the number of times each year was used in a Jeopardy answer. Based on these count values, we arranged the years in descending order and determined the top ten years that were most frequently referenced. Later, we created a new dataframe that contained only the top ten years and the corresponding number of times each of those years were given in an answer. Using ggplot, we were able to visualize our results. We created bar charts, a scatterplot, and a histogram. After inspecting each graph, we determined that the scatterplot provided the best method of visually answering our question because it enabled the viewer to most clearly see the relationship between year and total number of times each year appeared in the Jeopardy answers. We then turned to Flexdashboard. We used our scatterplot as a basic premise for making a plot that included vertical lines, using geom_vline within the ggthemes package, depicting eight major historical events that occurred within a truncated portion of years (1800-2019). Each world event was labeled using the geom_text function. We added two additional lines in a different color to distinguish the year Jeopardy first aired and the year R was published. After that, we generated a second graph. This time, we created a line graph, using geom_line, for the entire dataset of all years. After writing all of our code in R Markdown, we opened a new Flexdashboard project. Here, we created an interactive visualization of our two most important plots. For our first plot, we included the subsetted scatterplot with reference lines. The second plot we incorporated was the line graph of all of the years used in Jeopardy answers. We performed additional statistical analyses on the Results: From our analyses, we found that 1996 was the most frequently referenced year in Jeopardy when the answers from all 35 seasons were pooled. The 90’s decade holds the most referenced range of dates. We ranked the top 10 years, in descending order, by reference frequency: 1996, 1999, 1997, 1980, 2000, 1960, 1995, 1990, 1998, and 1970. Each year was referenced 1087, 1024, 1023, 999, 994, 975, 961, 906, 891, and 868 times, respectively.
Conclusions: The 90’s decade appears to be the most popular to have pulled answers from over a vast array of Jeopardy questions, with 1996 being the most common year since the debut of Jeopardy in 1964. The answer to our research question for a deeper understanding of those interested in Jeopardy was in fact revealed. This particular research question although of interest to our particular group, does not seem to have been explored previously. To our knowledge, there is no published data on this question. From R analysis, we found that 24% of all answers contain a year and further analysis revealed 1% of the years was 1996 bei
The looming question now remains; so just why did the 90’s decade, more specifically 1996, reveal itself to be the most commonly referenced? Why is 1996 so interesting? Yet another question to be explored and answered by young, enthusiastic R entrepreneurs.
References
Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2019). rmarkdown: Dynamic Documents for R. R package version 1.18. Arnold, Jeffrey B. (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes. Iannone, Richard, JJ Allaire and Barbara Borges (2018). flexdashboard: R Markdown Format for Flexible Dashboards. R package version 0.5.1.1. Jwolle1 (2019). Jeopardy_clue_dataset. Jeopardy Productions. Wickham, Hadley (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse. Xie, Yihui (2019). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.25.
---
title: "flexdashboard2"
output:
flexdashboard::flex_dashboard:
storyboard: true
social: menu
source: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
```
### Introduction
Research Question:
For all seasons of Jeopardy, from Season 1 through Season 35, what are the most common calendar years referenced in an answer?
Introduction:
Knowing what years have been commonly referenced in past seasons of Jeopardy is interesting to know from a human interest or anthropology viewpoint and is potentially advantageous to future Jeopardy contestants. Regarding the former perspective, the years that are frequently referenced gives us an idea of the years in which people are most interested. There is potential bias in what we consider memorable years. Often, we tend to recognize the years in which major historical events occurred. However, although those years may be significant because they contain a single, prominent event, that does not necessarily mean that those will be the same years that are commonly referenced in Jeopardy answers. This leads into the later perspective. There are years in which multiple, more miniscule, but nonetheless noteworthy, events occurred. For years that contain a plethora of smaller remarkable events, the total number of times those years are referenced as Jeopardy answers will be greater. Thus, potential contestants would benefit from focusing on learning about those years as opposed to years with fewer, but perhaps more major, events. Contestants may find it useful to know if there are a few standout years that tend to surface more often in Jeopardy answers. Furthermore, it is interesting to note whether answers are more concerned with years that tend to be associated with events such as major military happenings or more contemporary activities such as pop-culture events. Again, this provides insight into anthropological aspects of what people deem important as well as a potential benefit to contestants regarding what years they should focus their preparatory attention.
Methods:
We began our process by opening and establishing an R Markdown project. The first step towards answering our question began by downloading questions and answers from the Jeopardy seasons. All of these seasons were available on jwolle1’s GitHub in a dataset called “jeopardy_clue_dataset” that contained 350,000 Jeopardy clues from Season 1 through the duration of Season 35 (Jwolle1, 2019). We used three primary packages, tidyverse (Wickham, 2017), ggthemes (Arnold, 2019) and knitr (Xie, 2019), as well as two types of projects, R Markdown (Allaire et al., 2019) and Flexdashboard (Iannone, Allaire & Borges, 2018), and utilized regular expressions for pattern searching as well as plotting features for visualization. We read in the data and cleaned it by selecting the three categories of relevant interest: category, answer, and question. In order to extract information regarding the years referenced in the Jeopardy answers, we created a new column using the mutate function and extracted all four-digit numbers using the str_extract function from the stringr package within tidyverse. We then used filter and str_count on the new column to limit our results to dates that were contained exactly four-digit numbers. There were a few four-digit numbers that were not years, so we sliced the dataframe in a way that cut off numbers that were not a year between 1000-2019. In order to further simplify the dataset, we created another new dataframe that contained two columns that summarized each year and the count of the total number of times that year was referenced.
One issue that arose was the way in which R was reading the “years” column. Initially, the column was a character vector. However, in order to use those values for plotting, we needed to convert it to a numeric vector. Simply using the as.numeric function alone was insufficient because it created a new, separate dataframe rather than converting the vector within our primary dataset. To counter this issue, we used the as.numeric function within the transform function. Prior to transforming it, however, we mutated the dataset and created a new column, “count,” that counted the number of times each year was used in a Jeopardy answer. Based on these count values, we arranged the years in descending order and determined the top ten years that were most frequently referenced. Later, we created a new dataframe that contained only the top ten years and the corresponding number of times each of those years were given in an answer.
Using ggplot, we were able to visualize our results. We created bar charts, a scatterplot, and a histogram. After inspecting each graph, we determined that the scatterplot provided the best method of visually answering our question because it enabled the viewer to most clearly see the relationship between year and total number of times each year appeared in the Jeopardy answers. We then turned to Flexdashboard. We used our scatterplot as a basic premise for making a plot that included vertical lines, using geom_vline within the ggthemes package, depicting eight major historical events that occurred within a truncated portion of years (1800-2019). Each world event was labeled using the geom_text function. We added two additional lines in a different color to distinguish the year Jeopardy first aired and the year R was published. After that, we generated a second graph. This time, we created a line graph, using geom_line, for the entire dataset of all years.
After writing all of our code in R Markdown, we opened a new Flexdashboard project. Here, we created an interactive visualization of our two most important plots. For our first plot, we included the subsetted scatterplot with reference lines. The second plot we incorporated was the line graph of all of the years used in Jeopardy answers.
We performed additional statistical analyses on the
Results:
From our analyses, we found that 1996 was the most frequently referenced year in Jeopardy when the answers from all 35 seasons were pooled. The 90’s decade holds the most referenced range of dates. We ranked the top 10 years, in descending order, by reference frequency: 1996, 1999, 1997, 1980, 2000, 1960, 1995, 1990, 1998, and 1970. Each year was referenced 1087, 1024, 1023, 999, 994, 975, 961, 906, 891, and 868 times, respectively.
Conclusions:
The 90’s decade appears to be the most popular to have pulled answers from over a vast array of Jeopardy questions, with 1996 being the most common year since the debut of Jeopardy in 1964. The answer to our research question for a deeper understanding of those interested in Jeopardy was in fact revealed. This particular research question although of interest to our particular group, does not seem to have been explored previously. To our knowledge, there is no published data on this question. From R analysis, we found that 24% of all answers contain a year and further analysis revealed 1% of the years was 1996 bei
The looming question now remains; so just why did the 90’s decade, more specifically 1996, reveal itself to be the most commonly referenced? Why is 1996 so interesting? Yet another question to be explored and answered by young, enthusiastic R entrepreneurs.
References
Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2019). rmarkdown: Dynamic Documents for R. R package version 1.18.
Arnold, Jeffrey B. (2019). ggthemes: Extra Themes, Scales and Geoms for 'ggplot2'. R
package version 4.2.0. https://CRAN.R-project.org/package=ggthemes.
Iannone, Richard, JJ Allaire and Barbara Borges (2018). flexdashboard: R Markdown Format
for Flexible Dashboards. R package version 0.5.1.1.
Jwolle1 (2019). Jeopardy_clue_dataset. Jeopardy Productions.
Wickham, Hadley (2017). tidyverse: Easily Install and Load the 'Tidyverse'. R package
version 1.2.1. https://CRAN.R-project.org/package=tidyverse.
Xie, Yihui (2019). knitr: A General-Purpose Package for Dynamic Report Generation in R.
R package version 1.25.
```{r}
library(flexdashboard)
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
clues <- read_tsv("./data/master_season1-35.tsv")
answers <- clues %>%
select(category, answer, question)
#filter years out of answers, create new column, arrange descending
years_data <- answers %>%
mutate(years = str_extract(answer, "[0-9]{4}")) %>%
filter(str_count(years)==4)%>%
arrange(desc(years)) %>%
slice(413:84470) %>%
group_by(years)
years_data1 <- years_data %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
ungroup()
years2 <- transform(years_data1, years = as.numeric(years))
#years2
#nrow(years2)
clues <- read_tsv("./data/master_season1-35.tsv")
library(ggthemes)
reference_plot <-
ggplot() +
geom_point(data = years2, aes(x = years, y = count))+
coord_cartesian(xlim = c(1800,2019))+
geom_vline(xintercept=1996,colour="blue")+
geom_text(aes(x=1996,label="1996",y=1087,vjust=0,hjust=0),
colour="black",
text=element_text(size=11))+
geom_text(aes(x=1995,label=
"US Olympics",y=200),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1861,colour="blue")+
geom_text(aes(x=1860, label=
"American Civil War",y=350),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1914,colour="blue")+
geom_text(aes(x=1913,label="WWI",y=500),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1939,colour="blue")+
geom_text(aes(x=1938,label="WWII",y=500),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1955,colour="blue")+
geom_text(aes(x=1954,label="Vietnam",y=100),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=2008,colour="blue")+
geom_text(aes(x=2007,label="911 Attacks",y=200),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1963,colour="blue")+
geom_text(aes(x=1962,label=
"JFK Assassination",y=100),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1986,colour="blue")+
geom_text(aes(x=1985,label=
"Challenger Disaster",y=175),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1964,colour="red")+
geom_vline(xintercept=2000,colour="red")+
xlab("Years")+
ylab("Number of times year given in clue")+
ggtitle("Four digit years found in Jeopardy clues,Season 1-35",
subtitle = "Truncated from 1800-2019")+
theme_few()
reference_plot
```
### Jeopardy Reference Plot
```{r}
library(flexdashboard)
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
clues <- read_tsv("./data/master_season1-35.tsv")
answers <- clues %>%
select(category, answer, question)
#filter years out of answers, create new column, arrange descending
years_data <- answers %>%
mutate(years = str_extract(answer, "[0-9]{4}")) %>%
filter(str_count(years)==4)%>%
arrange(desc(years)) %>%
slice(413:84470) %>%
group_by(years)
years_data1 <- years_data %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
ungroup()
years2 <- transform(years_data1, years = as.numeric(years))
#years2
#nrow(years2)
clues <- read_tsv("./data/master_season1-35.tsv")
library(ggthemes)
reference_plot <-
ggplot() +
geom_point(data = years2, aes(x = years, y = count))+
coord_cartesian(xlim = c(1800,2019))+
geom_vline(xintercept=1996,colour="blue")+
geom_text(aes(x=1996,label="1996",y=1087,vjust=0,hjust=0),
colour="black",
text=element_text(size=11))+
geom_text(aes(x=1995,label=
"US Olympics",y=200),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1861,colour="blue")+
geom_text(aes(x=1860, label=
"American Civil War",y=350),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1914,colour="blue")+
geom_text(aes(x=1913,label="WWI",y=500),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1939,colour="blue")+
geom_text(aes(x=1938,label="WWII",y=500),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1955,colour="blue")+
geom_text(aes(x=1954,label="Vietnam",y=100),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=2008,colour="blue")+
geom_text(aes(x=2007,label="911 Attacks",y=200),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1963,colour="blue")+
geom_text(aes(x=1962,label=
"JFK Assassination",y=100),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1986,colour="blue")+
geom_text(aes(x=1985,label=
"Challenger Disaster",y=175),angle=90,vjust=0,hjust=0)+
geom_vline(xintercept=1964,colour="red")+
geom_vline(xintercept=2000,colour="red")+
xlab("Years")+
ylab("Number of times year given in clue")+
ggtitle("Four digit years found in Jeopardy clues,Season 1-35",
subtitle = "Truncated from 1800-2019")+
theme_few()
reference_plot
```
### Chart C
```{r}
line_plot <- ggplot(years2, aes(x = years, y = count)) +
geom_line() +
xlab("Years") +
ylab("Number of times year given in clue")+
ggtitle("Four digit years found in Jeopardy clues, Season 1-35") +
theme_few()
line_plot
```